82 research outputs found
From Regular Expression Matching to Parsing
Given a regular expression and a string , the regular expression
parsing problem is to determine if matches and if so, determine how it
matches, e.g., by a mapping of the characters of to the characters in .
Regular expression parsing makes finding matches of a regular expression even
more useful by allowing us to directly extract subpatterns of the match, e.g.,
for extracting IP-addresses from internet traffic analysis or extracting
subparts of genomes from genetic data bases. We present a new general
techniques for efficiently converting a large class of algorithms that
determine if a string matches regular expression into algorithms that
can construct a corresponding mapping. As a consequence, we obtain the first
efficient linear space solutions for regular expression parsing
Space-Efficient Re-Pair Compression
Re-Pair is an effective grammar-based compression scheme achieving strong
compression rates in practice. Let , , and be the text length,
alphabet size, and dictionary size of the final grammar, respectively. In their
original paper, the authors show how to compute the Re-Pair grammar in expected
linear time and words of working space on top
of the text. In this work, we propose two algorithms improving on the space of
their original solution. Our model assumes a memory word of bits and a re-writable input text composed by such words. Our
first algorithm runs in expected time and uses
words of space on top of the text for any parameter
chosen in advance. Our second algorithm runs in expected
time and improves the space to words
Subsequence Automata with Default Transitions
Let be a string of length with characters from an alphabet of size
. The \emph{subsequence automaton} of (often called the
\emph{directed acyclic subsequence graph}) is the minimal deterministic finite
automaton accepting all subsequences of . A straightforward construction
shows that the size (number of states and transitions) of the subsequence
automaton is and that this bound is asymptotically optimal.
In this paper, we consider subsequence automata with \emph{default
transitions}, that is, special transitions to be taken only if none of the
regular transitions match the current character, and which do not consume the
current character. We show that with default transitions, much smaller
subsequence automata are possible, and provide a full trade-off between the
size of the automaton and the \emph{delay}, i.e., the maximum number of
consecutive default transitions followed before consuming a character.
Specifically, given any integer parameter , , we
present a subsequence automaton with default transitions of size
and delay . Hence, with we
obtain an automaton of size and delay . On
the other extreme, with , we obtain an automaton of size and delay , thus matching the bound for the standard subsequence
automaton construction. Finally, we generalize the result to multiple strings.
The key component of our result is a novel hierarchical automata construction
of independent interest.Comment: Corrected typo
Sparse Regular Expression Matching
We present the first algorithm for regular expression matching that can take
advantage of sparsity in the input instance. Our main result is a new algorithm
that solves regular expression matching in time, where is the number of positions in
the regular expression, is the length of the string, and is the
\emph{density} of the instance, defined as the total number of active states in
a simulation of the position automaton. This measure is a lower bound on the
total number of active states in simulations of all classic polynomial sized
finite automata. Our bound improves the best known bounds for regular
expression matching by almost a linear factor in the density of the problem.
The key component in the result is a novel linear space representation of the
position automaton that supports state-set transition computation in
near-linear time in the size of the input and output state sets
Random Access in Persistent Strings and Segment Selection
We consider compact representations of collections of similar strings that
support random access queries. The collection of strings is given by a rooted
tree where edges are labeled by an edit operation (inserting, deleting, or
replacing a character) and a node represents the string obtained by applying
the sequence of edit operations on the path from the root to the node. The goal
is to compactly represent the entire collection while supporting fast random
access to any part of a string in the collection. This problem captures natural
scenarios such as representing the past history of an edited document or
representing highly-repetitive collections. Given a tree with nodes, we
show how to represent the corresponding collection in space and query time. This improves the previous time-space trade-offs
for the problem. Additionally, we show a lower bound proving that the query
time is optimal for any solution using near-linear space.
To achieve our bounds for random access in persistent strings we show how to
reduce the problem to the following natural geometric selection problem on line
segments. Consider a set of horizontal line segments in the plane. Given
parameters and , a segment selection query returns the th smallest
segment (the segment with the th smallest -coordinate) among the segments
crossing the vertical line through -coordinate . The segment selection
problem is to preprocess a set of horizontal line segments into a compact data
structure that supports fast segment selection queries. We present a solution
that uses space and support segment selection queries in time, where is the number of segments. Furthermore, we prove that
that this query time is also optimal for any solution using near-linear space.Comment: Extended abstract at ISAAC 202
Fast Dynamic Arrays
We present a highly optimized implementation of tiered vectors, a data
structure for maintaining a sequence of elements supporting access in time
and insertion and deletion in time for
while using extra space. We consider several different implementation
optimizations in C++ and compare their performance to that of vector and
multiset from the standard library on sequences with up to elements. Our
fastest implementation uses much less space than multiset while providing
speedups of for access operations compared to multiset and speedups
of compared to vector for insertion and deletion operations
while being competitive with both data structures for all other operations
Distance labeling schemes for trees
We consider distance labeling schemes for trees: given a tree with nodes,
label the nodes with binary strings such that, given the labels of any two
nodes, one can determine, by looking only at the labels, the distance in the
tree between the two nodes.
A lower bound by Gavoille et. al. (J. Alg. 2004) and an upper bound by Peleg
(J. Graph Theory 2000) establish that labels must use
bits\footnote{Throughout this paper we use for .}. Gavoille et.
al. (ESA 2001) show that for very small approximate stretch, labels use
bits. Several other papers investigate various
variants such as, for example, small distances in trees (Alstrup et. al.,
SODA'03).
We improve the known upper and lower bounds of exact distance labeling by
showing that bits are needed and that bits are sufficient. We also give ()-stretch labeling
schemes using bits for constant .
()-stretch labeling schemes with polylogarithmic label size have
previously been established for doubling dimension graphs by Talwar (STOC
2004).
In addition, we present matching upper and lower bounds for distance labeling
for caterpillars, showing that labels must have size . For simple paths with nodes and edge weights in , we show that
labels must have size
- …